Search CORE

24 research outputs found

An efficient algorithm for discovering frequent subgraphs

Author: George Karypis
Michihiro Kuramochi
Publication venue
Publication date: 01/01/2002
Field of study

Abstract — Over the years, frequent itemset discovery algorithms have been used to find interesting patterns in various application areas. However, as data mining techniques are being increasingly applied to non-traditional domains, existing frequent pattern discovery approach cannot be used. This is because the transaction framework that is assumed by these algorithms cannot be used to effectively model the datasets in these domains. An alternate way of modeling the objects in these datasets is to represent them using graphs. Within that model, one way of formulating the frequent pattern discovery problem is as that of discovering subgraphs that occur frequently over the entire set of graphs. In this paper we present a computationally efficient algorithm, called FSG, for finding all frequent subgraphs in large graph datasets. We experimentally evaluate the performance of FSG using a variety of real and synthetic datasets. Our results show that despite the underlying complexity associated with frequent subgraph discovery, FSG is effective in finding all frequently occurring subgraphs in datasets containing over 200,000 graph transactions and scales linearly with respect to the size of the dataset. Index Terms — Data mining, scientific datasets, frequent pattern discovery, chemical compound datasets

CiteSeerX

University of Minnesota Digital Conservancy

In this paper we study the problem of classifying chemical compound datasets. We present a sub-structure-based classification algorithm that decouples the sub-structure discovery process from the classification model construction and uses frequent subgraph discovery algorithms to find all topological and geometric sub-structures present in the dataset. The advantage of our approach is that during classification model construction, all relevant sub-structures are available allowing the classifier to intelligently select the most discriminating ones. The computational scalability is ensured by the use of highly efficient frequent subgraph discovery algorithms coupled with aggressive feature selection. Our experimental evaluation on eight different classification problems shows that our approach is computationally scalable and outperforms existing schemes by 10% to 35%, on the average

CiteSeerX

University of Minnesota Digital Conservancy

GREW—A Scalable Frequent Subgraph Discovery Algorithm

Author: George Karypis
Michihiro Kuramochi
Publication venue
Publication date: 01/01/2003
Field of study

Existing algorithms that mine graph datasets to discover patterns corresponding to frequently occurring subgraphs can operate efficiently on graphs that are sparse, contain a large number of relatively small connected components, have vertices with low and bounded degrees, and contain well-labeled vertices and edges. However, there are a number of applications that lead to graphs that do not share these characteristics, for which these algorithms highly become unscalable. In this paper we propose a heuristic algorithm called GREW to overcome the limitations of existing complete or heuristic frequent subgraph discovery algorithms. GREW is designed to operate on a large graph and to find patterns corresponding to connected subgraphs that have a large number of vertex-disjoint embeddings. Our experimental evaluation shows that GREW is efficient, can scale to very large graphs, and find non-trivial patterns that cover large portions of the input graph and the lattice of frequent patterns

CiteSeerX

University of Minnesota Digital Conservancy

Discovering Frequent Geometric Subgraphs

Author: Karypis George
Kuramochi Michihiro
Publication venue
Publication date: 21/10/2004
Field of study

Data mining-based analysis methods are increasingly being applied to datasets derived from science and engineering domains that model various physical phenomena and objects. In many of these datasets, a key requirement for their effective analysis is the ability to capture the relational and geometric characteristics of the underlying entities and objects. Geometric graphs, by modeling the various physical entities and their relationships with vertices and edges, provide a natural method to represent such datasets. In this paper we present gFSG, a computationally efficient algorithm for finding frequent patterns corresponding to geometric subgraphs in a large collection of geometric graphs. gFSG is able to discover geometric subgraphs that can be rotation, scaling, and translation invariant, and it can accommodate inherent errors on the coordinates of the vertices. We evaluated its performance using a large database of over 20,000 chemical structures, and our results show that it requires relatively little time, can accommodate low support values, and scales linearly with the number of transactions

University of Minnesota Digital Conservancy

Discovering Geometric Frequent Subgraphs

Author: Karypis George
Kuramochi Michihiro
Publication venue
Publication date: 17/06/2002
Field of study

As data mining techniques are being increasingly applied tonon-traditional domains, existing approaches for finding frequent itemsets cannot be used as they cannot model the requirement of these domains. An alternate way of modeling the objects in these data sets, is to use a graph to model the database objects. Within that model, the problem of finding frequent patterns becomes that of discoveringsubgraphs that occur frequently over the entire set of graphs. In this paper we present a computationally efficient algorithm for finding frequent geometric subgraphs in a large collection of geometric graphs. Our algorithm is able to discover geometric subgraphs that can be translation, rotation, and scaling invariant, and it can accommodate inherent errors on the coordinates of the vertices. We evaluated the performance of the algorithm using a large database of over 20,000 real two-dimensional chemical structures, and our experimental results show that our algorithms requires relatively little time, can accommodate low support values, and scales linearly on the number of transactions

University of Minnesota Digital Conservancy